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I.  THE  NEED  FOR  OBJECTIVE  MEASUREMENT 
The  Definition  of  Voriobles 

Work  in  the  behavioral  sciences  has  been  hampered  by  the  notion  that 
"measurement”  has  a  different  meaning  for  them  than  for  the  physical  sciences. 

But  it  Is  fundamental  in  all  scientific  investigation  to  abstract  from  experience 
simple  ideas  which  organize  the  complexity  in  useful  ways.  Useful  ideas, 
often  called  "variables,"  are  drawn  from  the  scientist's  careful  observa¬ 
tions  of  his  experience  but  they  are  necessarily  over-simplifications  intended  to  be 
meaningful  for  a  particular  purpose.  Another  scientist  with  other  purposes  may  con¬ 
struct  a  different  set  of  variables  to  summarize  similar  experiences.  Ideas  come  to  be 
generally  regarded  as  "true"  only  when  (and  so  long  as)  they  ore  useful  in  predicting 
outcomes  among ’an  interesting  class  of  possible  events. 

After  supposing  his  variable,  the  scientist  attempts  to  establish  its  definition  by 
collecting,  validating  and  calibrating  observations  that  provide  information  about  it. 
-Once  the  observations  with  which  to  measure  a  variable  have  been  specified  and 
calibrated,  the  scientist  has  established  an  operational  definition  of  that  variable. 

He  can  then  proceed  in  an  orderly  way  to  study  the  formulation  of  general  principles 
about  the  processes  involved  and  to  predict  the  outcomes  of  other  situations  in¬ 
volving  these  processes. 

The  Measurement  of  Variables 

Even  carefully  defined  observations  are  of  little  interest  in  themselves.  They  are 

seldom  chosen  only  for  their  own  sake  but  rather  for  the  information  they  contain  about 
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the  "variable",  which  is  supposed  to  lie  behind  them.  In  order  to  extract  this  in¬ 
formation  we  must  attempt  to  specify  explicitly  the  supposed  relationship  between  j 

observation  and  variable.  It  is  the  specification  of  this  relationship  that  enables  | 

us  to  make  inferences  about  the  amount  of  the  variable  that  each  object  possesses 
and  so  to  make  comparisons  among  objects  based  on  the  inferred  variable.  j 

The  intent  of  this  approach  is  to  become  free  from  the  particular  observations  | 

taken.  If  the  observations  are  appropriate  and  the  inferences  correctly  drawn,  we 
want  to  need  nothing  else  about  them.  We  want  to  be  able  to  make  whotever  1 

comparisons  we  choose,  among  objects  or  among  different  occasions  for  the  same  object,  j 
regardless  of  which  observations  were  made  in  each  instance.  Even  though  some 
observations  are  necessary  to  infer  the  amount  of  the  variable  present,  once  that  is  1 

done,  we  want  to  be  no  longer  bound  to  them.  j 

These  ideas  can  be  illustrated  with  a  simple  example.  A  person  ^ 

entering  a  room  might  observe  that  he  looks  up  to 

some  people  standing  in  the  room  and  down  to  others.  This  might  leod  him  to  j 

hypothesize  the  existence  of  a  "height"  variable.  He  might  then  decide  to  carry  ^ 

with  him  a  stick  with  marks  at  various  distances  from  the  end  and  tc  observe  for  each 
person  the  number  of  marks  exceeded.  This  would  permit  him  to  make  judgements  J 

about  the  amount  of  height  each  person  possesses  that  are  more  precise  than  "taller 

i 

than  me" or  Viorter  than  me". 

If  the  man  developed  a  means  for  translating  the  number  of  marks  passed  into  the  height  , 
of  the  person,  i.o.  a  model,  it  would  bo  possible  for  him  to  compare  any  person's  measure  j 
(l.o. ,  the  height  inferred  by  the  model  from  the  number  of  marks  passed)  with  any 
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other  measure  obtained  from  any  other  stick  that  has  been  connected  to  the  same  variable. 
The  sticks  need  not  be  the  same  length  nor  have  the  same  number  of  marks  nor  have 
the  marks  at  the  same  intervals,  so  long  as  each  has  been  properly  connected  to  the 
variable  "height". 

In  addition  to  freeing  him  from  the  necessity  of  always  using  the  same  (or 
identical)  sticks,  the  model  relating  Hie  observation  to  the  variable  must  also  provide 
him  with  the  means  for  assessing  the  validity  of  the  measurement.  If  a  person  is 
measured  twice  and  the  two  measures  are  not  the  same,  within  statistical  limits  due 
to  the  precision  of  the  instruments,  he  would  conclude  that  the  person  has  not  been 
properly  measured  and  without  additional  information  would  be  at  a  loss  to  know  which 
of  the  measures,  if  either,  should  be  associated  with  the  person. 

Because  the  measurement  permits  the  comparison  of  every  new  measure  with  all 
previous  measures  for  the  person,  with  a  little  experience,  our  observer  could  come 
to  recognize  characteristics  of  sticks  and  persons  which  lead  to  measures  that  persist 
from  trial  to  trial.  The  measurement  model  is  essential  in  this  process  because  it 
provides  a  framework  for  recognising  when  an  observation  is  surprising.  If  we  know 
a  person  once  passed  say,  1 17  marks  on  some  stick  and  we  now  observe  that  he  passes 
37  marks  on  another,  we  cannot  tell  if  this  is  a  surprising  result  unless  both  observa¬ 
tions  can  be  connected  to  the  same  fundamental  variable.  By  knowing  when  to  be 
alarmed,  an  observer  can  quickly  learn,  for  example,  that  flexible  round-ended 
sticks  often  give  unpredicted  results  and  that  the  height  of  people  cannot  be  measured 
reliably  when  they  are  running  or  (umping. 
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Height  is  so  familiar  that  we  feel  we  can  observe  it  directly.  But,  in  fact,  we 
cannot  "observe"  the  height  of  an  unfamiliar  object  when  it  is  viewed  in  complete 
isolation.  Like  all  other  variables,  our  observation  of  height  involves  a  series  of 
comparisons  of  the  unknown  object  with  some  available  calibrated  instrument. 

The  units  of  measurement  for  height  are  equally  familiar  and  arbitrary.  Their 
importance  and  usefulness  is  only  because  they  have  been  defined  and  the  definitions 
accepted  by  everyone  who  measures  height.  The  statement  that  a  person  is  six  feet 
In  height  now  specifies  his  height  unambiguously  with  no  further  information  required 
about  how  the  measure  was  obtained.  This  was  not  the  case  when  the  standard  of 
measure  was  the  king's  foot. 

Psychological  measurement  is  not  different  in  principle  from  other  kinds  of 
measurement  but  at  this  point  there  is  I ittl ^consensus  about  what  variables  are  important 
(i.e.  useful,  in  general)  and  what  units  are  convenient  to  measure  them.  The  following 
example  should  help  clarify  the  parallels  between  physical  and  psychological  measure** 
ment. 

An  observer  of  military  training  might  hypothesize  the  existence  of  a  marksman** 
ship  variable,  that  soldiers  vary  in  the  amount  of  this  variable  that  they  possess  and 
that  they  must  possess  a  certain  amount  in  order  to  be  competent  soldiers.  (It  should 
be  noted  that  this  last  hypothesis  goes  beyond  measurement.  The  consideration  of  how 
to  determine  the  amount  of  the  height  variable  that  a  person  possesses  did  not  involve 
decisions  about  how  much  he  should  have.  Only  after  obtaining  a  satisfactory 
meowre  of  the  variable  can  we  begin  to  investigate  the  relationship  with  other 
variables  to  establish  what  amounts  of  height  or  marksmanship  are  required  for  particular 
situations.) 
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One  plan  for  studying  marksmanship  would  be  to  follow  each  soldier  through 
his  career  and  observe  when  his  level  of  marksmanship  was  adequate  end  when  it 
was  not.  While  at  the  end  we  would  know  a  great  deal  about  those  particular  soldiers, 
we  would  not  be  able  to  make  comparisons  among  them,  since  it  is  unlikely  that  we 
would  have  comparable  data  for  any  two.  It  would  be  equally  impossible  to  predict 
their  success  in  any  new  situations  with  any  degree  of  precision. 

We  would  prefer  to  structure  the  situation  so  that  observations  relevant  to 
marksmanship  can  be  accumulated  quickly,  efficiently  and  economically.  We  mi&  ; 
decide  that  useful  observations  could  be  generated  from  the  task  of  firing  at  a  target 
on  a  practice  range.  While  this  obviously  does  not  involve  all  factors  that  might  be 
considered,  it  could  be  argued  that  it  does  contain  an  important  element  that  is  common 
to  any  situation  for  which  marksmanship  would  be  jnvolved.  Knowledge  of  the 
variable  defined  by  the  observation  of  firing  at  a  target  should  enable  us  to  make 
reasonable  predictions  about  the  outcome  of  more  complex  situations . 

But  the  number  of  times  the  person  succeeds  in  hitting  the  target  is  no  more  the 
measure  of  his  marksmanship  than  is  the  number  of  marks  passed  on  a  stick  a  measure 
of  his  height.  The  number  of  hits  will  depend  on  the  size,  distance,  etc  of  the  target 
(J.o.,  its  difficulty)  as  well  as  the  person's  skill.  We  require  a  model  to  remove  the 
effect  of  target  difficulty  and  to  translate  the  observation  into  a  measurement  about  the 
person.  With  this  accomplished  we  no  longer  need  worry  about  presenting  identical 
targets  to  every  person  any  more  than  we  need  to  measure  their  height  with  identical 
rulers.  All  we  need  are  calibrated  targets. 

Selection  of  the  task  and  the  measurement  model  are  crucial .  There  is  no  reason 
to  expect  that  every  observation  can  be  converted  into  the  measure  of  the  variable  we 
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want  or  that  every  mathematical  function  that  transforms  discrete  counts  into  continuous 
"variables"  will  be  equally  useful.  In  order  to  understand  what  is  required  of  these,  we 
need  to  develop  more  fully  what  it  is  reasonable  to  require  of  a  "measurement." 

The  Requirements  for  Good  Measurement 

At  the  very  least  a  good  measurement  model  should  require  that  a  valid  test 
satisfy  the  following  conditions: 

1 .  A  more  able  person  always  has  a  better  chance  of  success 
on  an  item  than  does  a  less  able  person. 

2.  Any  person  has  a  better  chance  of  success  on  an  easy  item 
than  on  a  difficult  one. 

It  follows  from  these  conditions  that  the  likelihood  of  a  person  succeeding  on  an  item 
is  the  consequence  of  the  person's  position  on  the  variable  (his  ability)  and  the  iter.  :s 
position  on  the  same  variable  (its  difficulty)  and  that  no  other  variables  influence  the 
outcome.  This  implies  that  the  difficulty  of  an  item  is  an  inherent  property  of  that 
item  which  adheres  to  it  under  all  relevant  circumstances  without  reference  to  any 
particular  population  of  persons  to  whom  the  item  might  be  administered. 

A  major  consequence  of  these  conditions  is  that  it  is  possible  to  derive  an 
estimator  for  each  parameter  that  is  independent  of  all  other  parameters.  All  informa¬ 
tion  about  a  person's  ability  expressed  in  his  responses  to  a  set  of  items  is  contained  in 
the  simple  unweighted  count  of  the  number  of  items  which  he  answered  correctly.  Raw 
score  is  a  sufficient  statistic  far  ability.  For  item  difficulty,  the  sufficient  statistic  is 
the  number  of  persons  in  the  sample  who  responded  correctly  to  that  item. 
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These  common  sense  requirements  enable  us  to  formulate  an  explicit  mathe¬ 
matical  model  and  to  use  this  model  to  assess  the  appropriateness  of  the  observations 
*  for  furnishing  information  about  the  variable  we  are  seeking.  These  requirements  are 
also  deceptively  demanding.  Successful  measurement  depends  on  achieving  sufficient 
control  with  respect  to  the  observations  taken,  so  that  their  variations  differ  only 
along  a  single  variable.  Even  though  persons  differ  in  many  ways,  their  measurement 
becomes  possible  only  when  one  of  these  dimensions  dominates  the  behavior  prompted 
by  the  items  administered.  Even  when  items  differ  on  a  number  of  factors,  they  can 
be  successfully  used  for  measurement  if  the  responses  of  persons  can  be  dominated  by 
only  one  of  these  factors.  Thus  measurement  can  succeed  despite  multidimensionality, 
when  the  multidimensionality  is  controlled  so  as  not  to  be  shared  actively  by  both 
persons  and  items.  Two  examples  should  help  moke  this  dear. 

Two  types  of  Items:  Suppose  we  wish  to  measure  "general  mental  ability"  and  to  do 
this  devise  an  instrument  containing  both  reading  comprehension  and  mathematical 
computation  items.  While  this  instrument  is  clearly  two  dimensional,  measurement 
with  it  could  succeed  in  situations  where  either 

1.  there  is  no  person  variable  which  affects  the  probability 
of  success  for  the  reading  items  differently  than  for  the 
math  items, 

2.  math  ability  and  reading  ability  are  so  highly  correlated 
in  the  population  that  they  do  not  appear  different. 

In  either  case  we  should  not  care  whether  measurements  were  made  entirely  with  reading 

Items,  entirely  with  math  items,  or  any  mixture  in  between,  since  alt  items  measure  the 

"same"  variable.  In  the  first  case,  there  is  only  one  variable  (perhaps  called  "general 
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mental  ability").  In  the  second,  there  are  two  but  since  they  are  so  highly  correlated 
a  person  high  on  one  is  high  on  the  other.  We  can  measure  math  ability  with  reading 
items  and  reading  ability  with  math  items,  if  we  choose.  It  does  not  matter  whether 
we  call  the  resulting  measure  math,  reading  or  general  ability.  However,  if  we  try 
to  assert  that  both  types  of  items  are  necessary  for  a  "fair"  measurement  and  become 
involved  in  setting  the  correct  proportion  of  each,  we  have  admitted  the  multi* 
dimensionality  of  the  situation  and  must  instead  measure  the  two  variables  separately 
with  items  appropriate  to  each.  (If  we  are  still  interested  in  one  number  we  could 
then  argue  about  how  the  two  measures  could  be  combined  into  a  single  index.) 

It  is  only  possible  to  measure  a  person,  who  always  has  many  different  abilities, 
on  one  variable  by  carefully  constructing  an  instrument  which  addresses  {ust  that  one 
variable.  We  may  sometimes  get  by  with  a  multidimensional  instrument,  since  the 
two  alternatives  above — one  variable  versus  two  highly  correlated  variables— are  not 
distinguishable  in  data,  but,  when  we  use  an  instrument  of  items  readily  classifiable 
into  two  or  more  types,  its  (effective)  unidimensionality  must  be  corroborated  with 
each  new  sample. 

One  type  of  item  with  extraneous  variables:  A  contrasting  case  can  be  illustrated  by 
considering  the  measurement  of  problem  solving  ability  with  an  instrument  composed 
of  word  problems.  Proficiency  on  this  instrument  requires  many  abilities  in  addition 
to  problem  solving,  not  the  least  of  which  is  the  ability  to  read  the  language  in  which 
the  problems  are  written.  If  the  reading  ability  of  every  person  is  well  above  .the 
readability  of  Hie  problems,  differences  among  items  or  persons  in  this  respect  will,  not 
affect  performance  on  the  instrument.  However,  if  any  person  has  difficulty  reading 
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the  problems,  his  measure  of  problem  solving  ability  will  be  biased  downward  by  this 
interfering  factor.  His  probability  of  success  will  be  influenced  by  the  interaction 
‘between  his  reading  ability  and  Hie  readability  of  the  problem.  This  can,  of  course,, 
be  eliminated  by  regulating  readability  to  be  well  below  the  reoding  ability  of  the 
target  population.  Then;  although  the  persons  may  still  vary  in  their  ability  to  read, 
variation  in  their  scores  on  this  instrument  will  be  due  to  their  variation  in  problem 
solving  ability  alone. 

This  case  differs  from  the  preceding  one  in  that  the  items  are  of  one  type  but 
each  has  a  "difficulty"  on  two  variables.  As  long  as  all  persons  are  sufficiently  able 
readers,  the  instrument  can  be  used  to  measure  problem  solving  ability.  Theoretically, 
such  an  instrument  could  also  be  used  to  measure  reading  ability  among  very  able 
problem  solvers  who  were  poor  readers. 

Random  guessing  on  multiple-choice  items  is  another  instance  of  extraneous 
variation.  Persons  who  are  guessing  succeed  on  difficult  items  more  often  than  their 
abilities  would  predict.  This  makes  them  appear  more  abla^when  more  difficult  items 
are  administered,*' nee  their  frequency  of  success  does  not  decrease  as  difficulty  in¬ 
creases.  A  similar  but  apposite  effect  occurs  when  .able  persons  become  careless  with 
easy  items,  making  them  appear  less  able  than  they  ere. 

Such  items  "measure"  two  variables — the  ability  of  interest  and  the  tendency  to 
guess  or  to  become  careless.  The  "guessingness"  of  the  item  may  or  may  not  be  a  simple 
function  of  the  difficulty  on  the  main  variable  but  for  the  person  two  different  vari¬ 
ables  are  involved.  The  measure  of  either  variable  is  threatened  by  the  presence  of 
the  other. 
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These  forms  of  mul tf dimensionality  have  in  common  the  attribute  that  different 
subsets  of  the  items  produce  non-equivalent  estimates  of  person  ability  and  different 
subsamples  of  persons  produce  different  estimates  of  item  difficulties.  This  contra¬ 
dicts  the  requirements  for  good  measurement  specified  in  the  Rasch  model. 

Unequal  Item  Discriminations:  No  discussion  of  disturbances  in  measurement  is  complete 
without  mention  of  Stem  "discrimination."  Rasch's  derivation  of  what  is  required  in 
order  to  achieve  objectivity  (i.e.,  measures  of  person  ability  that  are  free  of  the  sets 
of  items  administered,  and  calibrations  of  items  that  are  free  of  the  samples  of  persons 
used)  leads  to  a  model  which  rules  out  a  parameter  for  item  discrimination.  If  measure¬ 
ment  objectivity  is  to  be  achieved, the  situation  must  be  arranged  so  that  a  parameter 
for  discrimination  is  not  necessary. 

When  the  problem  is  approached  from  other  perspectives,  for  example,  when  the 
observations  are  considered  so  valuable  that  the  data  are  allowed  to  determine  the  form 
of  the  model,  regardless  of  the  effect  on  measurement,  item  discrimination  is  almost 
always  included  as  a  parameter.  A  model  with  a  discrimination  parameter  (or  any 
other  additional  parameter)  will  recover  the  observed  data  more  completely  than  one 
without,  but  it  is  not  at  all  clear  when  that  is  done  what  bearing  the  resulting  "estimates" 
of  discrimination  can  have  on  the  general izabi I ity  and  reproducibility  of  the  situation. 

It  remains  to  be  settled  whether  discrimination  "estimates"  pertain  to  stable,  meaning¬ 
ful  parameters  that  are  useful  in  characterizing  future  outcomes  of  similar  situations  or 
whether  they  are  only  temporarily  useful  as  descriptive  statistics  for  diagnosing  trouble 
In  one  set  of  observations. 
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In  theory,  item  discrimination  is  a  measure  of  the  amount  of  information  an 
item  contains  about  the  quantity  of  the  variable  that  a  person  possesses.  In  practice. 

It  is  better  described  as  an  index  ef  the  correlation  over  the  sample  of  the  item  score 
with  the  operationally  defined  variable.  These  correlations  can  be  "high"  or  "low" 
for  the  wrong  reasons. 

With  the  problem  solving  example,  if  the  items  vary  in  readability  and  their 
readability  is  near  enough  to  the  reading  level  of  the  persons  so  that  some  persons  have 
difficulty  reading  some  items,  then  these  items  will  appear  to  vary  in  their  power  to 
discriminate  along  the  problem  solving  scale?  due  to  their  connection  to  realability. 

If  the  calibration  sample  is  drawn  from  one  population,  items  which  no  one  is 
able  to  read  will  have  no  relationship  to  problem  solving  ability  and  items  which 
everyone  reads  without  difficulty  will  be  the  purest  instances  of  the  relevant  behavior. 
Hence,  the  highest  "discriminations"  will  be  associated  with  items  dominated  by  the 
variable  of  interest  and  the  lowest  will  be  for  items  most  influenced  by  other  factors. 

However,  if  the  calibration  group  consists  of  samples  from  two  populations  which 
have  identical  distributions  of  problem  solving  ability  but  differ  in  their  reading  ability, 
the  group  with  the  better  readers  will  tend  to  score'higher  on  the  test.  Then  the  items 
which  are  the  most  effective  at  separating  the  high  and  low  scorers  will  be  the  items 
most  influenced  by  readability.  Therefore,  in  this  instance,  the  items  with  the  highest 
apparent  discriminations  will  not  be  the  ones  with  the  strongest  relationship  to  the 
variable  of  interest  but  rather  the  ones  with  the  strongest  relationship  to  the  dimension 
along  which  the  populations  differ  most,  namely  readability. 

Models  which  include  an  item  discrimination  parameter  will  appear  to  "explain" 
data  from  either  of  these  situations.  Both,  however,  violate  the  undimensionolity 
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assumptions  employed  by  these  other  models,  as  well  as  by  the  Rasch  model.  There¬ 
fore,  we  are  in  the  unfortunate  position  of  having  data  which  seem  to  fit  the  model 
although  they  do  not  comply  with  the  requirements  of  unidimensionality.  The  Rasch 
model  avoids  this  potential  danger  by  uncovering  unacceptable  variation  in  discrimina¬ 
tion  and  avoiding  discrimination  as  an  item  parameter. 

A  situation  for  whieh  it  is  sometimes  argued  that  a  discrimination  parameter  is 
legitimate  is  one  in  which  Hie  items  vary  in  the  amount  of  random  fluctuating  inherent 
in  them.  This  is  analogous  to  items  differing  in  their  factor  loadings.  But  even  in  this 
case  the  requirements  for  good  measurement  given  previously  are  not  satisfied. 

Before  we  sacrifice  this,  we  should  consider  what  it  means  for  items  to  differ  in  their 
inherent  error  and  decide  what  it  is  reasonable  to  do  about  it. 

It  is  difficult  to  construct  examples  of  items  that  vary  in  information  which  can¬ 
not  be  explained  by  the  presence  of  additional  factors.  One  possible  case  might  be 
an  instrument  containing  both  multiple-choice  and  completion  items.  They  could  both 
reflect  the  same  variable  but,  since  different  behaviors  are  required,  they  might  differ 
In  their  relationship  to  the  variable.  We  might  axpect  a  completion  item, which 
requires  the  person  to  recall  and  write  in  the  correct  answer,  to  discriminate  more 
sharply  than  an  item  which  only  requires  the  person  to  recognise  the  answer.  Recogni¬ 
tion  items  give  the  person  who  does  not  recall  or  recognise  the  correct  response  the 
opportunity  to  eliminate  responses  he  knows  to  be  incorrect,  thereby  increasing  his 
chances  of  choosing  the  correct  one.  If  his  success  at  this  is  related  to  his  position  on 
the  latent  variable,  not  to  his  "test-wiseness"  or  any  other  extraneous  factor,'  intelli¬ 
gent  guessing  of  this  sort  can  provide  information  about  the  variable  of  interest. 
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But  this  problematic  situation  is  easily  avoidable  by  simply  not  mixing  items 
requiring  obviously  different  behaviors  on  the  same  elementary  instrument.  The  in¬ 
fluence  of  extraneous  factors  on  the  outcome  is  a  problem  for  alt  response  models. 

The  Rasch  model  is  less  susceptible  to  this  source  of  confusion,  since  it  is  not  so 
readily  adaptable  to  mixed  influences.  Previous  research  (Panchapakesan,  1969; 

Wright  and  Panchapakesan,  1969)  indicated  that  the  tests  of  fit  for  the  Rasch  model 
are  sensitive  enough  to  such  disturbances  to  protect  measurements  from  deterioration 
due  to  them. 

The  Rasch  Logistic  Response  Model 

George  Rasch  (1960)  provided  a  rethinking  of  the  measurement  problem  which 
overcomes  most  of  the  deficiencies  of  traditional  analysis  and  avoids  the  theoretical 
complications  of  the  other  latent  trait  models.  Rasch's  stochastic  response  model 
describes  the  probability  of  a  successful  outcome  of  a  person  on  an  item  as  a  function 
of  only  the  person's  ability  and  the  item's  difficulty.  Using  only  the  traditional 
requirement  that  a  measurement  be  based  on  a  set  of  homogeneous  items  monotonically 
related  to  the  trait  to  be  measured,  Rasch  derived  his  measurement  model  in  the  form 
of  a  simple  logistic  expression  end  demonstrated  that  in  this  form  the  item  and  person 
parameters  are  statistically  separable.  Andersen  (1973a)  elaborated  and  refined  the 
mathematical  basis  for  the  model.  Wright  and  Panchapakesan  (1969)  developed  practical 
estimation  procedures  that  made  application  of  the  model  feasible. 

Rasch's  model,  while  based  an  the  same  requirement  of  the  sufficiency  of  total 
score  relied  an  by  traditional  methods,  offers  new  and  promising  opportunities  for 
advancing  our  understanding  of  maaturamant  and  departures  from  it.  Since  the  parameters 
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of  the  m octal  or®  separable,  It  Is  possible  to  derive  estimators  for  each  parameter  in* 
dependency  of  the  others.  The  logistic  transformation  assigns  an  ability  of  minus 
Infinity  to  a  score  of  zero  and  plus  infinity  to  a  score  of  one  hundred  percent.  This 
eliminates  the  bounds  on  the  ability  range  and  puts  the  standard  errors  of  measurement 
into  a  reasonable  relationship  with  the  information  provided  by  observed  score.  The 
tests  of  item  fit  which  are  the  basis  for  item  selection  are  sensitive  to  high  discriminations 
as  well  as  to  low  and  so  lead  to  the  selection  of  those  items  which  form  a  consistent 
definition  of  the  trait  and  to  the  refection  of  exceptional  items.  Finally,  the  explicit¬ 
ness  of  the  mathematical  expression  of  the  model  facilitates  statistical  statements  about 
the  significance  of  individual  person-item  interactions  and  so  makes  both  a  very  general 
and  a  very  detailed  analysis  of  misfit  possible. 

The  Rasch  model  provides  an  explicit  framework  for  comparing  observed  with 
expected  outcomes.  The  expected  outcome  of  administering  an  item  to  a  person  is  that 
predicted  by  the  model  assuming  that  the  item  is  appropriate  with  respect  to  that  person 
and  that  the  person  was  adequately  motivated  to  bring  his  full  ability  to  bear  on  the 
item.  The  model  permits  us  to  assess  the  likelihood  of  the  observed  result,  and  hence, 
to  make  statements  about  the  appropriateness  of  the  particular  item  for  the  particular 
person. 

Objective  measurement  eliminates  many  of  the  problems  that  have  plagued  test 
users.  The  Rasch  model  is  both  necessary  end  sufficient  for  objectivity  in  measurement. 

To  best  utilize  the  power  of  this  model,  we  need  to  develop  fully  the  concepts  and 

'  '.t. 

mathematics  related  to  it.  Chapter  II  provides  this  development.  The  first  section 
reviews  Hie  philosophy  and  concepts  presented  by  Rasch  and  his  students.  Section 
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two  derives  the  estimating  equations  for  the  Bernoulli  (!•*•  one  trial  per  task)  form 
and  then  generalizes  to  the  Binomial  form  (several  trials  per  task).  Finally  goodness 
of  fit  tests  are  presented  for  assessing  the  adequacy  of  the  calibration. 


CHAPTER  !! 


DEVELOPMENT  OF  THE  RASCH  MODEL 

Rasch's  development  of  his  approach  to  measurement  places  central  emphasis 
on  the  concept  of  "specific  objectivity".  {Rasch,  I960,  1961,  1967,  1968;  Wright, 
1968.)  The  problem  of  measurement  is  to  make  comparisons  among  two  or  more  persons 
(more  generally,  "objects")  or  two  or  more  items  ("agents")  using  the  information  from 
the  interaction  of  the  objects  with  the  agents.  In  psychometrics,  we  often  beg  in  by 
determining  the  characteristics  of  items  based  on  an  administration  to  a  sample  of  people 
but  our  ultimate  aim  is  to  compare  the  performance  of  people  on  a  set  of  items. 

By  "specific  objectivity",  Rasch  means  a  comparison  of  any  two  persons,  derived 
from  a  set  of  person-item  interactions,  which  is  independent  of  all  item  parameters  and 
of  all  person  parameters  other  than  the  two  in  question.  Similarly,  a  statement  about 
two  items  is  independent  of  all  person  parameters  and  all  other  item  parameters. 

While  such  a  property  is  highly  desireable  (Loevinger,  194 7),  it  is  not  a  natural 
consequence  of  person-item  interactions  but  must  be  specifically  built  into  the  measure¬ 
ment  profits  for  every  situation.  The  more  natural  circumstance  is  for  every  person  to 
bring  many  abilities  into  every  confrontation  with  an  item.  Unless  the  item  is  carefully 
constructed  to  tap  only  one  of  these  abilities,  the  process  will  be  governed  by  any 
number  of  person  or  item  characteristics, 

Rasch  (1960,  1961)  based  his  development  of  a  measurement  model  on  the  following 
assumptions: 
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(a)  the  probability  of  a  correct  response  to  Item  i  by 
person  v  is  entirely  governed  by 


TTT. 

VI 


0  .  *0 
VI 


(b)  in  which  the  situational  parameter  0^,  is  the  product  of  two  factors 


B  .  ■  5  *. 

vi  *  v  i 

where  5  y  pertains  to  the  person  and  c.  to  the  item,  and 

(c)  all  answers,  given  the  parameters,  are  stochastically  independent. 

It  is  clear  from  (o)  that  0^.  represents  the  odds  of  success  and  from  (b)  that  is 
the  ability  of  person  v  and  c^is  the  easiness  of  item  S. 

The  separability  (also  called  "latent  additivity")  of  the  parameters  shown  in  (b) 
makes  possible  objectivity  in  measurement.  It  follows  from  this  that  all  information 
about  a  person's  ability  contained  in  his  responses  to  a  set  of  items  is  captured  by  the 
simple  count  of  correct  responses.  This  permits  us  to  compare  the  abilities  of  two 
persons  independently  of  the  items  administered. 

Following  Rasch  (1960),  the  logarithm  of  the 'odds  of  success  on  item  i  by  person  v 
Is: 

P 


0) 


log  -or  *  lo9  (Sv  «|)  *  ,bs  5V+  log  ci. 

vi 


Therefore  the  abilities  of  person  v  and  person  u  when  observed  on  any  item  i  con  be 
compared.  In  logistic  units,  by  subtraction: 

(2)  *  ’ofl  5  y  +  log  e  j  -  log  5  y  **  log  t;  *  log  $v  ”  log 
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which  docs  not  involve  the  item  parameter  at  all.  Actually  computing  a  number  to 

estimate  this  difference  requires  us  to  make  use  of  the  sufficiency  of  total  score.  Since 

all  the  information  about  ability  is  contained  ?n  the  number  of  correct  responses,  all 

persons  who  have  the  same  score  must  be  assigned  the  same  estimated  ability.  Therefore, 

by  grouping  together  the  responses  of  all  persons  who  scored  r,  we  can  obtain  an  estimate 

of  P  .  for  all  v  with  scores  or  r: 
vi 

Xri 

(3)  P  j  »  -jq—  where  P  .  is  the  probability  of  success  on  item  i 

8  r 

by  persons  with  score  r, 

Xff  is  the  number  of  persons  with  score  r  who 
answered  item  i  correctly,  and 

is  the  number  of  persons  with  score  r. 

And  so 


is  the  difference  in  ability  between  a  person  with  score  r  and  a  person  with  score  s, 
estimated  with  item  i.  Of  course  information  is  usually  available  from  more  than  one 
item;  statistical  techniques  which  amalgamate  the  information  from  all  into  a  single 
estimate  are  presented  in  Hie  next  section. 

Since  all  parameters  always  appear  in  combination  with  at  least  one  other  parameter, 
there  Is  an  indeterminancy  in  the  system  that  must  be  resolve^  before  a  particular  estimate 
of  a  person's  ability  or  an  item's  difficulty  can  be  calculated.  This  can  be  done  in  many 
ways;  a  simple  one  is  to  select  one  item,  say  item  T,  as  the  reference  point  and  *">t  i  - 
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log  easiness  to  zero,  this  arbitrary  choice  does  not  affect  the  comparison  of  two 
persons  in  expression  (2)  but  it  makes  it  possible  to  compute  a  particular  estimated 
ability  for  each  person.  From  expression  (1)  the  estimated  ability,  log  5  for  score  r 
is  now  equal  to  the  log  odds  of  success  on  item  1  for  persons  with  score  r. 


(5) 


fx,  1 

log  Sy*  log  1  -  ’  ;f  «j  2 

r  rl 


1 


Similarly  values  can  be  computed  for  all  items  other  than  item  1  by 
(6)  Ap  =  log  +  log  «  -  log  5r  -  log  e1  =  log  e 


where  Ap  is  the  difference  between  the  logit  for  item  {  in  score  group  r  and  the  logit 
for  item  1  in  score  group  r,  or 


Once  difficulties  have  been  estimated  in  this  fashion,  we  are  able  to  compare  two  people, 
as  in  (2)  above,  who  did  not  take  the  same  item. 

Andersen  (1973a)  provides  the  proof  for  the  other  side  of  specific  objectivity.  He 
shows  that  if  raw  score  is  taken  as  the  sufficient  statistic  for  ability,  then  the  underlying 
model  must  be  the  Rasch  model.  It  follows  from  this  that  the  three  assumptions  given 
above  are  both  necessary  and  sufficient  for  specific  objectivity. 

Wright  (1968)  introduced  the  terms  "sample-free  item  calibration"  and  "test-free 
person  measurement."  This  is  not  intended  to  imply  that  anything  can  be  known  about 

\  c . 

a  person's  ability  without  administering  some  items  or  that  anything  can  be  known  about 
on  item's  difficulty  without  giving  it  to  some  persons.  It  does  mean,  as  illustrated  above. 
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that  we  can  obtain  estimates  of  ability  that  do  not  depend  on  the  difficulties  of  the 
particular  set  of  items  we  choose  to  administer.  Any  other  set  of  appropriate  items 
produce  a  statistically  equivalent  estimate  for  the  person. 

This  avoids  some  troublesome  problems  for  the  test  user.  It  solves  the  problem  of 
form  equating.  Once  a  bank  of  items  has  been  calibrated  (i.e.,  the  difficulty  of  each 
item  estimated),  any  form  made  up  of  Items  from  that  bank  has  also  been  calibrated.  Its 
ability  estimates  are  on  the  common  scale  of  measurement  with  no  further  manipulation. 
This  was  dramatically  illustrated  by  Renhz  and  Bashaw  (1975)  who  showed  substantial 
savings  in  time  and  money  through  the  use  of  Rasch  techniques  over  traditional  methods 
of  form  equating.  The  logical  extension  of  this  property  suggests  that  each  person  can 
be  administered  a  test  tailored  specifically  for  him  and  still  measures  can  be  obtained 
that  are  comparable  for  all  (Wright,  1968,  Wright  and  Douglas,  1 975a). 

Before  presenting  a  discussion  of  some  of  the  methods  available  for  obtaining  estimates 
of  the  model's  parameter,  we  should  mention  that  ability  and  difficulty  will  be  expressed 
throughout  in  "logits"  which  are  arbitrary  units  of  measurement.  A  person's  ability  in 
logits  is  the  natural  log  odds  in  favor  of  his  succeeding  on  an  item  whose  difficulty  is 
at  the  origin  on  the  scale.  In  other  words,  a  person  with  ability  0.0  (i.e.,  ability 
equal  to  the  difficulty  of  an  item  at  the  origin)  has  an  even  chance  (odds  1  to  1)  of 
succeeding  on  the  item  since  log  (1)  -  0  or  equivalently,  from  expression  (1), 


(7) 

or 


1.0 
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*7 
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Similarly',  a  person  with  ability  1 .0  has  about  a  three  to  one  chance  of  success  (log  3 
is  approximately  equal  to  1 .0).  Logits  are  used  because  they  are  computationally  con¬ 
venient.  In  connection  with  this,  we  will  use  the  notation: 


-  log  *  logit  ability  for  person  v 
“6  *  log  c.  *  logit  difficulty  for  item  i. 


Calibration  of  Item  Parameters 

Several  methods  of  estimating  item  parameters,  are  treated  in  detail  in  Douglas 
(1975)  and  Wright  and  Douglas  (1975b,  1976,  1977a,  1977b).  They  are  reviewed  below. 


Conditional  Maximum  Likelihood  Estimation:  The  mathematically  ideal  method  is  the 
conditional  maximum  likelihood  approach  which  follows  naturally  from  the  separability 
of  parameters.  The  estimates  were  derived  in  detail  by  Andersen  (1973c).  An  approxi¬ 
mation  was  developed  by  Wright  (1966).  Andersen's  derivation  begins  with  the  Rasch 
model: 

exp  [X  .(fl  -6.)J 

- Vl  v  ■  X  .  =  0, 1 

1  +  «xp[0  -6.J 

i  *1,L 

v  *  1,N 


«  P(XvX.«i) 


If  this  model  fully  characterizes  the  interaction  between  person  v  and  any  item  i  the 
likelihood  of  a  particular  set  of  responses  to  L  items,  denoted  by  (X^.),  is 


00)  p«xv.)|0v,(5.)}=n 


exp[Xv.(0v-6i$ 
1  +  exp(/3^-6.) 


•*Plrv£v}«xpEfXv;6j} 

nll+exp(/3v-6.)l 
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This  probability  is  seen'  to  be  composed  of  three  ports:  exp (r^/1^)  which  connects  the 
person's  score  and  his  ability;  exp  (-S  X^.fi .)  which  connects  the  data  and  the  Item 
parameters;  and  the  denominator  which  involves  no  data. 

The  probability  of  observing  a  given  raw  score 


L 


is  the  sum  of  the  probabilities  of  all  possible  ways  of  obtaining  the  score  r.  That  is, 

«P(flEX  )  (-5X  A) 

(12 )  PtpX.-r*  fc.))»L - for  all  L  X  .  -  r 

'  V'  V  '  ij(l  +  **P(0„-5,)1  1  Vl 


(13)  PCCXv.  =  rl)3r,(6j)} 


H  (1  +  exp(/3v-6.)I 


where  yf  i*  an  elementary  symmetric  function  of  the 
item  difficulties  which  equals 

'  Yr-$«*P(-EXv|6j)]  for  all  2  Xy.  *  r 

and  the  summation  is  over  ail  possible  response  vectors  which  sum  to  r. 

The  conditional  probability  of  response  vector  (xy.)  given  the  raw  score  is  found 
by  dividing  (10)  by  (13): 

exp(-EX  6) 

CO  rt(xvl)|  V(6,)).  — i-U- 
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which  is  an  expression  in  the  item  parameters  that  is  free  of  the  ability  distribution  of 
the  persons.  This  result  depends  on  raw  score  being  a  sufficient  statistic  for  ability. 
The  conditional  likelihood  of  the  entire  data  matrix  ((Xy.))»  consisting  of  the  L 


responses  by  each  of  the  N  persons,  is: 


(15)  A-  P RCX  » |  (r  ),«S  ))=  n 
vi  v  »  t/ 


«p(-pvie,) 


since  the  observations  are  stochastically  independent  given  the  parameters  or  their 

N  L-1  n.  N 

sufficient  statistics.  Letting  s.  =  ^  X^.  and  p  ^  ^  ^  we  have  for  the  conditional 

v 

likelihood: 


06) 


exp(-S*.S.) 

n 

n  Y  r 

v  r 


Estimators  for  the  (fi. )  are  found  by  maximizing  A  in  the  usual  way.  Details  of  this  and 
the  iterative  procedures  necessary  for  obtaining  estimates  are  given  by  Andersen  (1972), 
Douglas  (1 975)  and  Wright  and  Douglas  (1975b). 


Unconditional  maximum  likelihood  estimation:  While  formally  correct,  the  conditional 
estimation  techniques  have  serious  practical  problems.  The  computation  of  the  elementary 
symmetric  functions  is  quite  expensive  by  the  methods  now  used  end  incurs  unacceptably 
large  roundoff  errors  for  tests  of  length  greater  than  twenty  items.  Wright  developed  a 
less  expensive  technique  using  unconditional  maximum  likelihood  which  is  reported  in 

'  '.t 

detail  in  Wright  and  Panchapakesan  (1969)  and  Wright  and  Douglas  1975b).  In  their 
development,  the  unconditional  likelihood  of  the  data  matrix  is  the  double  product  of 


P^.  over  all  persons  and  items.  Thus, 


07) 


or 

08) 


A® 


L  N 

nnP(xvi!?v,s.)- 


•xptpsx^-y1 
anil  +  exp(/3  -6.)] 

.  y  r  »r-y  , 


^nti  +  •**(*„ -s,)l 


Again  the  responses  are  stochastically  independent  given  the  parameters.  (The  high 
correlations  that  are  usually  observed  among  a  person's  responses  to  a  set  of  items  are 
due  entirely  to  their  common  relationship  to  the  person's  ability,  0^  which  the  items 
are  attempting  to  measure.) 

The  algebra  for  maximizing  this  likelihood  is  less  complex  if  we  work  with  the  tog 
likelihood: 

(19)  X  =  log  A  =  5  r  0  -  $  s.fi.  -  E  S  log  [1  +  EXP(0 -A)]  +  cpE6.  . 

V  v  v  I  I  I  J  y  V  I  i  I 

The  final  term  is  included  to  remove  the  indeterminancy  in  the  equations 
that  arises  because  only  differences  between  parameters  are  estimable. 

The  op- term  removes  the  problem  here  by  imposing  the  restriction  that  E  fi.  3  0.  While 
almost  any  restriction  on  the  6|  would  do  this  particular  one  is  convenient  for  reasons 
to  be  discussed  later. 


The  derivatives  needed  to  obtain  the  maxima  of  (19)  are: 
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ax  exp(£v-fi.) 

(20)  jjr—  *  r  -  5  -  “  rv  "  f  Pvi 

1  1  +  «xp(j8v-6.)  V  ' 


3X  *P<*V-»|> 

(2,)  «r  -  -■i+=  m^7«7  +  ** 


(22) 


A 


r 


*  -S 


.  «*p(5  -a.) 


V  I 


»  n+«xpov-fi.)i 


rs'f  pvi(T*pvi)  <0 


(23) 


»2x 


■-S 

v 


«xp  0v-fi.) 


afi^  v  [1+exp  (^v-fi.)] 


W  =  -E  P  .(1-P  .)  <0 
2  y  vi  vi 


Wright  and  Douglas  (1975b)  demonstrated  that  the  crass  derivatives  are  small  and  can 
be  ignored  without  harming  the  resulting  estimates. 

Since  both  second  derivatives  are  always  negative,  there  can  only  be  one  extreme 
point  and  it  must  represent  the  maximum  likelihood.  This  point  can  be  determined  by 
setting  equations  (20)  and  (21)  equal  to  zero  and  solving.  We  First  need  to  evaluate  cp . 
Summing  equation  (21)  over  all  items, 

(24)  -?*|+S  E  P  •  0 

I  *  I  V  Vl  I 

or 

(25)  -S  SX  +L  spv|u$.  0 

and  since  from  (20) 

(25)  ISX  .«  ESP  .  . 

7  v  vi  j  y  VI 
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We  must  have  that  $  =  0.  The  estimating  equations  are  simply: 


We  are  able  to  substitute  an  r  subscript  for  the  v -subscript  in  (27)  because  r  is  a 
sufficient  statistic  for  ability  so  persons  who  attain  the  same  score  are  indistinguishable 
as  far  as  our  knowledge  of  their  ability  is  concerned.  It  is  more  efficient  to  perform  the 
summations  from  1  to  L-1  rather  than  }  to  N. 

Since  (27)  and  (28)  can  not  be  solved  explicitly  for  bf  and  d.,  we  must  resort  to  an 
interative  solution.  The  simplified  Newton-Raphson  approach  given  by  Wright  and 
Panchapakesan  (1969)  works  quite  well  for  this. 


(29) 

and 

(30) 


dHl  *  d* 

i  ! 


s.  -L  n  Pr. 
i  r  r  n 

S  n  P*.(l-P\) 
7  r  n  ri 


The  meaning  of  these  expressions  can  be  grasped  intuitively  by  noting  that  the 
numerator  of  each  correction  term  (!.e.,  the  right  hand  terms)  are  equations  (24)  and 
(25).  When  this  term  is  zero/  the  equation  is  solved  and  we  no  longer  need  modify  the 
estimates.  If  it  is  not  zero,  we  adjust  the  estimate  by  an  amount  proportional  to  this 
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difference.  The  scaling  factor  in  the  denominator  is  the  derivative  of  the  P  .  with 

ri 

respect  to  the  parameter  the  change  in  scale  from  score  units  to  logit  units. 

Starting  values  needed  to  begin  the  process  can  be  obtained  by  computing  the  d, 
assuming  the  b,  are  zero  and  analogously,  the  b^  assuming  the  d.  are  zero.  From  (27) 
we  have 


or 

/N-s.\ 

(32)  d?  »  log  — -L  . 

i 

From  (28)  we  obtain 

It  is  dear  from  any  of  the  estimation  equations  that  zero  or  perfect  scores  for 
either  persons  or  items  can  not  be  used  to  estimate  parameters.  In  (32)  and  (33),  this 
would  lead  to  either  zero  or  infinity  for  which  the  log  function  is  not  defined.  In 
(29)  and  (30),  the  process  could  not  converge  unless  all  P  j  were  zero  or  one,  which 
can  not  happen  unless- the  abilities  or  difficulties  are  plus  or  minus  infinity. 

In  light  of  this,  the  first  step  in  the  estimation  process  must  be  the  elimination  of 
zero  and  perfect  scores.  This  process  may  require  more  than  one  cycle  since  the 
elimination  of  an  item  which  every  one  answered  correctly  necessitates  the  elimination 
of  all  persons  who  only. answered  that  one  item  correctly,  and  so  forth. 
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A  second  problem  is  that  the  unconditional  maximum  likelihood  estimates  ore 
biased  (Andersen,  1973b).  For  the  case  of  a  two  item  test  it  can  be  shown  that  the 
difficulties  are  biased  by  a  factor  of  two.  Wright  and  Douglas  (1975b), based  on 
earl  'er  work  by  Wright  (1966),  demonstrate  that  for  tests  of  any  length  L  for  which 
ZH.  *  0  the  average  bias  is  (LA-1)  and  that  correcting  all  difficulties  by  this  factor 
results  in  estimates  that  are  virtually  indistinguishable  from  those  given  by  the  more 
expensive  but  unbiased  conditional  estimation  procedure. 

The  corrected  unconditional  estimation  algorithm  employed  by  most  Rasch  analysis 
programs  (e.g.,  Wright  and  Mead,  1975)  is 

i)  Obtain  item  scores,  (s.),  and  counts  of  the  number  of 
persons  at  each  score,  (n^) . 

ii)  Edit  these  data  vectors  to  remove  perfect  scores 

(i.e.,  s.  =  0  or  N  and  r  =  0  or  L)  cycling  as  many  times  as  necessary. 

iii)  Define  an  initial  set  of  (bp  as 

br  *  ,0s  (urr)'  r“  T'  L‘] 

iv)  Define  an  initial  set  of  (d.)  as 


i*  1,  L 


Center  the  item  set  at  zero  by  subtracting  d.  s£d  A  from  each  d.. 

v)  Obtain  a  revised  set  of  (d.)  by  the  one  dimensional  Newton-Rap hson 

\t. 

alogrithm  until  convergence  is  achieved, 

vi)  Using  the  tentative  set  of  (dp  as  obtained  from  (v)  above,  obtain 
a  revised  set  of  (bp  once  again  by  Newton  Raphson. 


29 


vii)  Repeat  steps  (v)  and  (vi)  as  often  as  necessary  to  obtain 
stable  values  for  the  (d.). 

v?H)  Correct  for  bias  by  multiplying  each  d.  by  (L-l)A. 
ix)  Calculate  the  approximate  (b_)  for  these  unbiased  (d.). 


Cohen's  normal  approximation:  As  a  final  alternative  to  the  problem  of  estimating  item 
difficulty  parameters,  Wright  and  Douglas  (1975b)  present  the  details  to  a  very  in¬ 
expensive  procedure  that  was  suggested  by  Cohen  in  1973.  This  procedure  assumes 
that  person  abilities  are  given  by  an  explicit  function  of  total  score,  and  that  the 
function  is  completely  determined  except  for  a  single  multiplying  parameter  which  can 
be  obtained  by  maximum  likelihood.  This  implies  that  the  distribution  of  both  person 
abilities  and  item  difficulties  are  adequately  characterized  by  the  first  two  moments. 

If  this  is  true,  the  resulting  estimates  are  identical  to  those  obtained  by  the  more 
expensive  procedures  just  discussed. 

The  procedure  is  as  follows: 

!)  Define  the  initial  values  of  difficulties  and  abilities  and  their  variances 
in  the  sample: 


df  *  log 


N-s, 


\  *1 


-d.  where  d?  =£d.A 


D-r«^2/[(L-1)(2.89)l 


■v  ■ lofl  (cp  ) 

B  ■^n/b®-  b?)2/[(N-1)(2.89)]  where  b?  ■  ^n^/N 


r  -  1,1-1 
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i?)  Compute  the  expansion  coefficients; 

X«[(1+  B)/(l  -  BD)]1/2 

Y-  [(1  +  D)/(l  -  BD)11/2 

lii)  Compute  the  final  estimates  of  the  parameters  and  their  standard  errors: 


(34)  d,  -  Xd® 

SE(dt)  =  X[N/s.(N-s.)]1/2 

b  *  Yb° 
r  r 

(35)  SE(b)  =  YfL/r(L  -  r)]1/2  . 

Although  there  is  only  modest  experience  with  this  form  of  the  alogorithm  evidence 
indicates  that  for  moderately  long  instruments  and  more  or  less  symmetrical,  unimodal 
score  distributions,  it  yields  estimates  well  within  a  standard  error  of  the  values  obtained 
from  the  more  expensive  methods. 


Binomial  Extension  of  the  Simple  Logistic  Model 

Not  all  data  is  scored  dichotomously.  However,  the  ideas  and  equations  of  the 
preceding  sections  can  be  extended  to  more  complex  cases.  Consider  a  situation  in  which 
a  subject  v  receives  a  score  of  0,  1,  .  .  . ,  rrij  on  an  item  i.  This  might  be  a  score  on  an 
attitude  scale,  an  aptitude  test,  or  target  shooting.  If  this  score  is  taken  to  be  generated 
as  the  result  of  m.  independent  Bernoulli  trials,  each  with  probability  of  success  P^., 
then  the  binomial  response  model 


00 


m. 


0 


-P  .) 

vr 
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describes  it  (Andrich,  1975).  In  a  given  situation  we  may  not  be  certain  that  this 
model  (or  the  specialization  we  propose)  is  appropriate,  but  we  can  test  the  fit  of  the 
model  pnce  the  parameters  have  been  estimated. 

It  Is  useful  to  write  this  model  in  odds  notation  by  lotting 


(37) 


VI 


X  ./(I 

VI 


XvP 


where  Xy.  is  interpreted  as  the  odds  of  success.  Then 


m. 


(38) 


P(X  .1  X  .,  mj  *  I  X  .  X 

VI1  VI  i  \  vi  /  ' 


X  .  /  m. 

.  v,/(l  u.|'  . 

VI  VI 


,1 .. 


By  analogy  to  Rasch's  simple  logistic  model  it  seems  likely  that  it  will  be  useful  to 
write 


(39) 


t.  . 

i 


That  is,  each  X  .  will  be  taken  to  be  the  product  of  a  person  parameter  %  and  an  item 


VI 


parameter  c.  .  With  this  assumption  we  have 


(40) 


pogv  v  "V 


m. 

i 


vi 


(?v  «i> 


vi 


m? 


•  Hi 

0  +  «,) 


V  I 


Note  that  if  m  =  T,  then  X  .  is  zero  or  one  and  expression  (40)  reduces  to  the 
i  vi 

Bernoulli  form  of  the  preceding  section. 

\c. 

Conditional  Estimation 

Let  us  consider  the  possibility  of  estimating  the  parameters  ?y  and  c.  .  The  model 
^The  notation  is  somewhat  less  complex  if  we  define:  ?y  *  exp(j9 y)  and  c.  *  exp  (-8.). 
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(40)  implies  as  usual  Hie  inevitable  assumption  of  conditional  independence  of 
responses  over  persons  and  over  items,  given  the  parameters  of  the  model.  Suppose, 
then,  that  a  person  responds  to  L  items.  By  our  assumption  of  conditional  independence, 
the  probability  that  his  responses  will  be  X^.,  .  .  .,  (which  we  shall  write  as 
fcv.)),  given  the  parameters,  is 

n 

(41)  P(P<vi>|!v.<‘i).  <mi>)  -  - 


where  (c.)and  (ro.)  represent  *j,  .  .  .,  c^andm^,  .  .  .,  m^  respectively.  If  we 
now  denote  the  total  score  for  any  person  os 


(42) 


r  =X  *  SX  . 

V  v+  f  VI 


then 


rt'J  V  'V- ptxvil?v,(«;).  K)1 = 


n 


V  Til  ▼  | 

IT)? 

no  *  5V «,) 
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which  can  be  rewritten  as 


(43)  PErvIV(,i)'("’i)]*  - - S 

no  *Vi> 

where  the  sum  is  taken  over  all  collections  of  responses  (xyj,  •  •  .,  x^)  such  that 


x  ,  +  ...+x,  ■  r  . 
vl  vL  v 
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The  conditional  probability  of  a  particular  set  of  responses  (Xy.)  can  be 
found  by  dividing  (41)  by  (43)  and,  observing  that  the  probability  is  new 


independent  of  6y. 


m  p((xyl>|v<«.»i» 


¥1  XJ  *r 


IX. 


where  the  summation  in  the 
denomination  is  over  all 
persons  with  score  r. 


Clearly  r  is  a  sufficient  statistic  for  ?  and  s.  =  X  .  is  a  sufficient  statistic  for  e.  so  all 
'  v  v  r  +i  i 

the  information  about  a  person's  ability  or  an  item's  difficulty  is  contained  in  the 
appropriate  total  score.  Furthermore,  given  a  group  of  persons  it  is  now  possible  in 
principle  to  compute  the  conditional  likelihood  of  their  responses  and  to  estimate  the  item 
difficulty  parameters  by  conditional  maximum  likelihood  estimation  independently  of  the 
abilities.  Similar'  ,  abilities  could  be  estimated  independently  of  item  difficulties. 

Details  of  the  conditional  maximum  likelihood  estimation  procedure  for  the  simple 
logistic  case  (all  m.  *  3}  can  be  found  in  Wright  and  Douglas  (1975b).  Unfortunately, 
the  conditional  maximum  likelihood  estimation  is  quite  sensitive  to  round-off  errors; 
even  an  improved  estimation  procedure  which  Wright  and  Douglas  devised  failed  for 
moderate  numbers  of  items.  There  is  no  reason  to  believe  that  conditional  estimation 
would  be  more  practicable  in  the  binomial  case. 


Unconditional  Estimation 


Even  if  the  conditional  estimation  procedure  could  be  made  to  work,  its  excessive 
cost  would  probably  inhibit  wide  application.  Recognizing  the  cost  and  instability  of 
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conditional  estimation,  Wright  and  Panchapakesan  (1969)  proposed  a  method  of  joint 
parameter  estimation  for  the  simple  logistic  model .  This  estimation  procedure  has  been 
extended  to  the  binomial  case. 

Let  ((Xy;))  be  the  matrix  of  responses  of  persons  1 ,  . . . ,  N  to  items  1 , . . . ,  L, 
that  is. 


rxn 

X.2 

...  x]L 

(45) 

«XV.»  - 

X2J 

e 

• 

X22 

e 

• 

...  x2L 

1  XN1 

XN2 

*•*  XNL 

By  conditional  independence  we  have  the  joint  probability 


(46)  A  =  P[((Xv|))|  (5 v),(«.)}  - 


N  L  mi  X  . 

11  n  Xvi  ftvV  " 
N  L  m[ 

n  n(i  +  ?y  «•>  ' 


so 


m. 


X  *  log  A  B  5  L  log  ^Xy. 


+  E  r  log  5  +  S  s.  log  < . 

v  v  *  ’v  j  I  i 


-EE  m  log  (l  + §  c.). 
i  v  1  v  • 

Writing  0  »  log  5  ,  6.  *  “log  «.  os  in  the  simple  logistic  Case  gives 


v  1 

fm 


(47)  L-f  Elog  |x‘.j 


+  S  ry/5v  -  E s{  -E  Em.  log(1  +  exp(9y-  6.)). 
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Thus 

(48) 


ax_ 


r  *  w  m,  P  •  /  u ~  1/  •  •  •;  N» 
M  J  '  M' 


(49) 
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rm.P  .0  -  P  .),  fi=  I,  . ..,  N, 
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(50) 
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and 


(51) 
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-  -5  mi  Pvi('  -  Pv|)'  ••-L- 


Recall  that1  all  subjects  with  score  r  will  have  the  same  estimated  ability  ^  so 
equations  (48)  lead  to  the  estimation  equations 


(52) 


r  -  S  m.  P  .  =  0, 


i  ri 


r*  1,  M  -  1 


where  d.  =  logf(N-S./SiJ.  .  Observe  that  (52)  has  no  solutions  for  zero  and  perfect 
scores,  so  they  must  be  eliminated  from  the  data.  Similarly,  (50)  gives  the  estimation 
equations 


(53)  s  j  —  S  m.  P^.  *0,.  |  =  1,...,L 

where  n^  is  the  number  of  subjects  with  score  r. 

'  '.t 

Our  experience  with  the  simple  logistic  model  leads  us  to  expect  a  dependency  in 
these  equations,  and,  indeed,  summing  (52)  over  r  and  (53)  over  j  gives  identical  sums. 
We  resolve  this  dependency  by  setting  L  m{  d.  =  0.  Other  constraints  might  be  used. 
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but  this  one  helps  keep  down  rounding  error  during  estimation;  and  other  linear  con¬ 
straint  can  be  implemented  by  transforming  the  parameter  estimates  obtained  using 
this  one. 

It  is  now  a  simple  matter  to  estimate  the  j3y  and  fi.  by  the  Newton-Raphson  method. 
Details  of  the  estimation  process  for  the  simple  logistic  case  can  be  found  in  Wright  and 
Douglas  (1975b)  or  Wright  and  Mead  (1975).  Their  procedure  generalizes  directly  to 
the  binomial  case. 

Andersen  .(1 973a)  has  shown  that  these  estimates 
are  biased.  However,  Wright  and  Douglas  (1975b)  show  by  simulation  that 

most  of  the  bias  can  be  cleared  up  by  multiplying  the  d.  by  (L-l)A  when  all 
m.  *  1 .  Further  simulation  indicates  that  (M-1)/M  is  a  suitable  unbiasing  constant 
for  the  binomial  case. 

Standard  Errors 

In  principle,  asymptotic  estimates  of  the  standard  errors  of  the  parameter  estimates 
are  given  by 

SE(0)  =  [  -  diag  [(  A/&2)’'  }]V2  . 


Here  the  matrix  of  second  derivatives  is  nearly  diagonal,  so  we  take 
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(55) 


se<V*  (2mi  V  -  pri»‘,/2  • 


Tests  of  Fit 

A  primary  benefit  from  having  an  explicit  mathematical  model for  a  process  is  the 
possibility  of  making  rigorous  tests  of  how  well  the  observed  data  are  predicted  by  the 
model.  In  the  case  of  the  Rasch  model,  the  most  detailed  form  of  the  data  iron  N  x  L 
matrix,  denoted  by  ((X^.))  consisting  of  one  row  for  each  person  and  one  column  for 
each  item.  The  entry  X^.  is  the  score  of  person  v  on  item  i.  It  has  a  range  of  0  to  m.. 
For  the  most  familiar  Bernoulli  form  of  the  model,  all  m.  are  equal  to  one. 

The  expected  value  of  X^.  Is 
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and  its  variance  is 


(50  V(Xvi)  -  Vvi0 -Pvl>- 

Therefore,  the  difference  between  the  observed  score  for  the  person  and  the  predicted 
score 


(57) 
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moy  be  standardized  by  dividing  by  the  estimated  standard  deviation. 

(58) 
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The  sample-free  property  of  the  model  suggests  one  strategy  for  organizing  the 
residuals.  Since  the  estimates  of  item  difficulty  should,  under  the  model,  be  independent 
of  the  distribution  of  person  ability,  the  difficulty  estimator  should  be  equally  appropriate 
for  all  scores.  In  other  words,  we  should  obtain  the  same  estimated  difficulty  when  just 
the  low  scores  are  used  as  when  the  high  scores  are  used.  If  we  were  to  ad  fust  the 
estimates  to  fit  score  r  exactly  the  first  adjustment  for  item  i  would  be  proportional  to 
(compare  to  expression  (29)) 


(59) 
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If  we  standardize  by  dividing  by  the  standard  deviation  and  square 
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We  obtain  a  chi-square  statistic  with  one  degree  of  freedom.  The  multiplier  K  is  a 
correction  factor,  usually  near  one,  to  inflate  the  statistic  to  the  equivalent  of  one 
degree  of  freedom.  (Haberman,  1973).  If  all  the  n,  are  equal  and  P^.O-P^.)  is  nearly 
constant  for  all  r  and  i,  then  K  can  be  shown  to  be: 


(61) 
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The  intuitive  motivation  for  this  can  be  grasped  easily  by  noting  that  since  i  goes  from 
1  to  L  end  r  from  1  to  M-l,  there  ore  L(M-1)  statistics  V^.  But,  having  fit  L-1  item  poramet 
andM-1  person  parameters,  there  are  only  (L-1)  (M-2)  degrees  of  freedom  available. 
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Sine*  it  frequently  happens  that  some  scores  are  not  observed  in  a  particular 
sample,  or  are  very  rare,  the  summation  may  also  be  done  over  score  groups  containing 
more  than  one  score: 


(62) 
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Collecting  over  groups,  v  .  *  £  VT.  gives  a  chi-square  statistic  with  g  degrees  of 

9'  |cg  I* 

freedom. 

2 

While  V^.  specifically  asks  the  question  would  all  score  groups  give  the  same 
estimate  of  difficulty  for  item  !,  it  is  possible  to  compute  a  more  general  statistic  from 
expression  (58).  Squaring  and  summing  over  ail  persons  gives  a  test  statistic  for  the 
fit  of  item  i: 


«  -  5  zvi  (jEW-ijj 

2 

which  is  approximately  distributed  as  a  chi-square  with  N  degrees  of  freedom  Vj  will 

tend  to  be  large  when  different  groups  give  different  estimates  of  abilities  (as 

will  Vj)  and  when  persons  in  the  same  score  group  obtain  their  score  in  different  ways. 
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A  frequently  mentioned  alternative  to  the  Rasch  model  is  the  logistic  model  con¬ 
taining  a  second  item  parameter,  item  discrimination.  While  this  model  lacks  the 
essential  measurement  properties  of  the  Rasch  model,  it  can  help  conceptualize  mis¬ 
fitting  data  if  an  index  of  discrimination  is  computed.  Such  an  index  can  be  derived 
as  follows.  General  expression  for  the  probability  of  the  success  of  one  trial  is 


,  exrcxv() 

m  p(xv;  ‘ 11 "  tfkpttt  • 

VI 

One  possible  parameterization  of  Xis  that  employed  by  the  Rasch  model;  i.e.,  Xy.  =  0y-6. . 
A  possibility  for  an  alternative  generator  might  include  a  discrimination  parameter.  Then 
the  probability  would  be  as 


(65) 
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If  this  is  the  actual  generator,  then  (64)  and  (65)  are  equal  and  the  logits  (the 
exponents  in  this  application)  are  also  equal,  statistically,  hence. 


(66) 


where  the  residual  error  e  .  is  included,  because  the  linear  model  cannot  account  for 

vi 

all  the  variation  in  Xyj.  Since  expression  (64)  provides  a  unique  parameter  for  each 
person-item  combination,  Xy|  is  the  same  as  the  observed  logit  which  in  most  applica- 
tion  would  be  estimated  by: 


(«7) 
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However,  in  our  case.,  this  cannot  be  done  when  X  .  is  either  zero  or  m.. 


VI 


To  escape  this,  let  us  rewrite  (66)  in  terms  of  a  residual  from  the  Rasch  model: 


m 
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The  logistic  residual  Ay.  can  be  approximated  from  (57)  by  recalling  that  the  rate  of 
exchange  between  score  units  and  logits  is  approximately  equal  to  the  derivative  of 
P  .  with  respect  to  (0y  -  fi.).  This  derivative  is 
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and  therefore. 
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Rewriting  (68)  in  terms  of  statistics  which  we  can  compute,  we  have 

(71)  Y  -  o.(b  -  d.)  +  e  .  . 
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where  a.  *  fa.  -  1).  Since  with  respect  to  item  i ,  the  difficulty  d.  is  constant,  an 
Index  of  the  item's  discriminating  power  can  be  computed  by  regressing  Yy.  on  ability. 
Therefore, 


(72) 


;  (br-  b.)5 


where  b.  ■  £  n^b ^/N 


42 


and  the  associated  sum  of  squares  is 


v 


All  the  test  of  fit  statistics  presented  in  this  section  hove  the  appearance  of 
chi  square  (or  mean  square)  variates,  but  recent  simulation  studies  (Mead,  ]J76) 
show  that  this  distribution  is  not  exactly  correct.  Hence,  exact  probability  state¬ 
ments  about  lack  of  fit  are  not  possible.  The  chi-square  distribution  is  a  useful 
background  against  which  to  judge  these  statistics,  however. 


CHAPTER  III 


DESCRIPTION  OF  THE  BICAL  PROGRAM:  ANALYSIS  OF 
MJLITARY  POLICE  PISTOL  DATA 

BICAL  is  a  FORTRAN  program  designed  to  estimate  and  test  the  parameters  in  the 
Rasch  model  when  written  as: 


(74) 
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The  observation  X^.  represents  the  numbers  of  successes  by  person  v  in  m.  trials  at  task  i. 
The  capacities  of  the  program  are  listed  in  Table  1 .  The  number  of  subjects  permitted 
is  restricted  only  by  the  availability  of  auxiliary  storage.  A  description  of  the  required 
control  cards  is  given  in  Appendix  C.  The  military  police  data  will  be  used  to  illustrate 
the  program's  application. 

The  pistol  data  was  collected  to  assess  the  competence  of  MP  candidates.  It 
Involved  eight  target  presentations  which  differed  in  the  distance  from  the  marksman 
and  the  position  from  which  he  was  required  to  fire.  For  the  first  two  targets,  ten  shots 
were  required;  for  the  remaining  six,  only  five  shots.  The  description  of  the  task  and 
number  of  shots  at  each  are  summarized  in  Table  2.  Table  3  shows  the  control  cards  used 
for  this  analysis. 
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TABLE  1 


BICAL  PROGRAM  CAPABILITIES 


i 

i 

i 


Description  Symbol  Maximum  Value 


Number  of  Items  L  150 

Trials  on  one  task  35 

Total  number  of  trials  M  *  E  m.  1000 

I  1 


TABLE  2 


DESCRIPTION  OF  THE  MILITARY  POLICE  PISTOL  DATA 

Task 

Number 

Meters  to 
Target 

Position  of 
Marksman 

Task 

Name 

Number  of 

Shots 

1 

35 

Prone 

35P 

10 

2 

25 

Kneel 

25N 

10 

3 

25 

Strong  Left 

25SL 

5 

4 

25 

Strong  Right 

25SR 

5 

5 

15 

Kneel 

15N 

5 

6 

15 

Strong  Left 

15SL 

5 

7 

15 

Strong  Right 

15SR 

5 

8 

7 

Crouch 

7C 

5 
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TABLE  3 

SAMPLE  CONTROL  CARDS  FOR  RUNNING  BICAL 


Card 

Number 

Card 

Name 

Card 

Format 

Sample  from  MP  data 

1 

Title  Card 

(20A  4) 

Military  Police  Data-Hits  Per  Target 

2 

Input  Description 

(1415) 

8  25  5  45  12  2  1 

3 

Item  Names 

(20A4) 

35P  25N  25SL25SR15N  15SLI5SR  7C 

4 

Column  Select 

(80A1) 

AA555555 

5 

Key 

(80A1) 

77333333 

6 

Options  Labels 

(5A1) 

12345 

7 

tData  Cards) 

7q 

End  of  Data 

(A1) 

• 

10 

End  of  Job 

(A4) 

**** 
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Cord  1,is  o  title  card  which  supplies  identifying  information  to  be  printed  at  the 
top  of  each  page  of  output. 

Card  2,  the  data  description  card,  describes  how  the  data  is  to  be  presented  and 
handled.  The  data  for  this  run  are  described  as  follows. 

First,  there  are  eight  tasks,  i.e.,  each  person  has  eight  scores  to  be  read, 
as  described  in  Table  2.  Second,  the  groups  used  in  tests  of  fit  must  average  at  least 
25  persons  each.  This  determines  the  number  of  groups  that  will  be  used.  The  program 
format  is  limited  to  six  groups,  but  if  the  total  number  of  persons  divided  by  six  is  less 
than  twenty-five  (as  in  this  case)  fewer  groups  will  be  used.  This  value  will  also  halt 
the  estimation  of  parameters  if,  after  editing,  there  are  fewer  than  twenty-five  persons 
remaining  in  the  sample.  If  no  value  is  provided  the  default  value  is  thirty. 

The  third  and  fourth  values  define  the  range  of  scores  to  be  included  in  the  calibra¬ 
tion  sample.  Only  persons  scoring  at  least  five  but  not  more  than  forty-five  will  be  in¬ 
cluded.  This  is  done  because  extremely  high  or  extremely  low  scorers  frequently  behave 
abnormally.  The  scores  to  be  excluded  need  to  be  thought  through  for  each  applica- 
tiorvfor  their  choice  depends  on  the  way  extreme  scores  might  occur.  In  achievement 
testing,  it  is  usually  desirable  to  set  the  lower  limit  somewhat  above  the  chance  level . 

Fifth,  the  value  of  "12”  indicates  that  only  the  first  twelve  columns  of  each  record 
need  be  read.  Since  in  this  case  the  data  is  punched  in  columns  5  through  12,  there  is 
no  need  to  read  beyond  12. 

Sixth,  the  "2"  selects  the  second  available  calibration  technique.  This  is  the 
corrected  unconditional  method,  and  is  chosen  for  this  problem  because  of  the  small 
sample  size  and  the  asymmetrical  distribution  of  scores. 
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Seventh,  the  "1"  specifies  that  the  data  is  already  scored  and  the  value  to  be 
read  for  each  item  is  the  person's  number  of  hits  on  that  target. 

All  remaining  parameters  will  assume  default  values.  This  means  that  data  input 
is  from  cards,  no  output  will  be  produced  except  on  the  printer  and  all  standard  printed 
output  will  be  generated. 

Card  3,  the  item  name  card  provides  a  four  character  name  for  each  item  read. 

There  are  eight  such  fields  here  and  they  are  coded  in  the  same  order  as  the  items 
occur  on  the  data  cards. 

Card  4,  the  column  select  card  serves  two  functions.  First,  any  character  other 
than  blank  or  zero  indicates  that  the  column  contains  an  item  included  in  the  item  count 
(8)  on  the  data  description  card  and  named  on  the  item  name  card.  An  ampersand  ( &) 
causes  an  item  to  be  excluded  from  the  analysis  although  read  and  named.  This 
facilitates  dropping  misfitting  items  with  a  minimum  of  changes  to  the  control  cards. 

The  column  select  also  defines  the  maximum  possible  score  (m.)  for  each  item.  Since 
the  fields  are  only  one  column  in  width,  the  alphabetical  characters  (A-Z)  are  used  to 
designate  the  values  (10-35).  ^  A  value  larger  than  35  cannot  be  accepted  by  the 
program  as  it  now  stands. 

The  interpretation  of  the  card  given  in  Table  3  is  that  no  data  is  wanted  from  columns 
1  to  4  of  the  input  record.  Columns  five  and  six  contain  tasks  which  have  maximum  scores 
of  10.  Columns  seven  through  twelve  contain  tasks,  which  have  maximum  scores  of  five. 
This  accounts  for  all  eight  items. 

'Data  cards  must  be  coded  in  the  same  way. 
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Card  5,  the  key  card  in  Table  3  is  not  referenced  in  this  run  since  the  data  is 
already  scored  (col.  35  of  the  data  description  card)  but  a  key  card  must  be  included 
in  the  deck. 

Card  6,  the  options  label  card  defines  up  to  five  possible  data  values.  The  same 
five  values  apply  to  all  items.  The  frequency  of  occurence  of  each  of  the  specified 
values  is  accumulated  for  each  item.  For  this  example,  a  table  showing  the  number 
of  times  each  target  was  hit  1,  2,  3,  4  or  5  times  will  be  prepared  and  printed. 

Interpretation  of  BICAL  Output  for  MP  Pistol  Data 

The  analysis  offered  here  is  intended  to  illustrate  interpretation  of  the  BICAL  out¬ 
put;  it  is  not  a  definitive  analysis  of  this  particular  data  set.  Page  1  of  the  output  shown 
in  Appendix  B  lists  the  control  cards  just  discussed.  This  enables  the  user  to  check 
quickly  that  the  analysis  performed  is  the  one  intended.  In  addition,  the  first  input 
record  and  the  total  number  of  records  are  shown  to  verify  that  the  records  were  read 
correctly. 

Page  2  is  the  alternative  response  frequency  table  that  was  specified  by  the  Options 
Label  Card  6.  The  "UNKN"  column  is  the  count  of  the  frequency  of  any  character 
other  than  the  five  shown.  Since  targets  1  and  2  could  have  scores  from  zero  to  ten, 
these  tasks  show  a  large  number  of  unknowns.  For  the  others,  the  only  unknowns  are 
the  zero  scores. 

Page  3  reviews  the  editing  process.  For  this  example  there  were  no  persons  with 

t  \e 

zero  or  fifty  hits.  If  there  had  been,  these  persons  would  have  been  excluded  from  all 
subsequent  analyses.  There  were  eight  columns  selected  by  the  column  select  card  and 
eight  item  names  were  provided. 
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There  were  no  persons  below  five  and  ten  above  forty-five  leaving  a  total  of 
126  to  be  used  in  the  calibration.  This  table  would  not  include  the  zero  or  perfect 
scores  noted  earlier. 

No  items  were  rejected  because  of  perfect  or  zero  item  scores.  Therefore,  the 
analysis  will  be  done  on  eight  tasks  with  126  persons.  Had  items  and  persons  been 
eliminated  the  minimum  and  maximum  accepted  scores  would  have  been  suitably 
adjusted.  ^ 

Pages  4  and  5  are  histograms  for  person  scores  and  item  scores.  For  persons, 
the  number  at  each  score  (i.e.,  number  of  hits)  is  shown.  This  is  seeled  to  fill 
the  grid  with  the  scale  factor  shown  at  the  bottom.  For  items,  the  figure  shows  the 
proportion  of  success  for  each  item.  For  instance,  there  were  764  successes  in  1260 
trials  on  item  one.  The  general  impression  given  by  these  graphs  is  that  the  tasks 
were  "relatively  easy"  for  the  persons  resulting  in  high  item  scores,  which  increase  as 
target  distance  decreases,  and  that  there  is  a  negatively  skewed  distribution  of  person 
scores. 

Page  6  contains  the  difficulty  estimates  and  the  related  standard  errors  of  cali¬ 
bration  for  each  item.  These  are  the  values  needed  for  any  future  application  of  the 
items.  The  mean  difficulty  (weighted  by  the  number  of  trials  for  each  item)  is  always 
zero.  As  expected  from  the  histogram,  the  difficulties  decrease  as  target  distance  de¬ 
creases.  The  standard  errors  are  smallest  for  the  most  difficult  tasks  because  of  the  high 
ability  of  the  sample.  These  are  the  tasks  with  difficulties  most  like  the  abilities  of 

.t 

the  persons  tested  and  hence  for  which  the  most  information  was  obtained. 
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The  table  also  provides  some  statistics  on  the  estimation  process.  At  the  top  the 
difficulty  and  ability/  "scale  factors"  indicate  the  amounts  by  which  the  initial  log 
odds  estimates  were  inflated  by  the  normal  approximation  method,  "PROX" . 

The  body  of  the  table,  in  addition  to  the  difficulty  estimates  and  their  standard 
errors,  displays  the  magnitude  of  the  adjustment  in  the  last  cycle  (an  indication  of  the 
rate  of  convergence),  the  difficulty  estimate  that  was  returned  by  "PROX”  and  the 
estimate  after  one  cycle  in  "UCON" .  These  are  displayed  to  provide  experience  with 
how  PROX  compares  to  UCON  and  when  the  less  expensive  estimates  are  good  enough. 

In  this  instance,  there  is  little  difference  in  the  estimates  even  though  the  score  distribu¬ 
tion  is  skewed. 

Page  7  gives  the  conversion  of  raw  scores  to  estimated  abilities  and  the  standard 
errors  of  measurement  associated  with  each  score.  The  test  characteristic  curve  is  a 
picture  of  the  range  of  ability  covered  by  these  eight  tasks. 

Pages  8  and  9  display  a  variety  of  item  fit  statistics.  Unlike  estimates  of  item  • 
difficulty,  the  tests  of  fit  are  very  much  sample  dependent.  That  an  item  fits  for  one 
sample  does  not  guarantee  it  will  fit  for  another.  Useful  interpretation  of  these  statistics 
requires  both  familiarity  with  them  and  a  thorough  understanding  of  the  tasks  and 
sample  that  generated  them. 

The  basic  statistic  is  the  overall  Fit  Mean  Square  which  appears  on  both  pages 

(under  the  heading  "total"  on  page  8).  This  is  simply  the  mean  squared  standard 

2 

residual  z  |  averaged  over  persons,  tt  will  be  large  for  an  item  if  there  are  too  many 
high  ability  persons  who  failed  on  the  item  and/or  too  many  low  ability  persons  who 
succeeded.  Whet  is  "too  large"  depends  on  the  requirements  of  the  particular  situa- 
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Hon.  The  expected  values  and  standard  errors  of  these  mean  squares  are  1  and  (2/f) 
where  f  is  the  number  of  groups.  More  than  three  standard  errors  greater  than  one 
seems  to  be  a  reasonable  rule  of  thumb  for  “too  large". 

Two  targets  which  have  little  else  in  common  fall  into  this  category.  Target  1 
is  unique  in  several  respects;  it  is  the  first  in  the  sequence,  it  is  at  the  greatest 
distance  and  is  the  only  one  that  involves  the  prone  position.  Alt  of  these  could  be 
contributing  to  the  misfit.  Choosing  among  them  would  require  a  clinical  investigation 
of  the  situation.  This  mean  square  says  only  that  performance  on  this  task  has  the  weakest 
relation  to  performance  on  the  other  seven  tasks.  The  non-significant  between  group 
mean  square  for  this  item  (of  1.90)  indicates  that  statistically  equivalent  estimates  of 
difficulty  would  result  from  using  either  the  low  scorers  or  the  high  scorers  for  calibration. 

Target  6  is  not  interesting  in  its  position  in  the  sequence  and  there  were  other  targets 
at  the  same  range  and  same  position.  This  mean  square  is  an  index  of  the  disagreement 
between  the  variable  as  defined  by  the  item  and  the  variable  as  defined  by  all  items. 

The  fit  for  this  target  which  involved  firing  from  the  left  could  change  if  the  mixture 
of  shots  from  the  right  and  left  were  changed.  This  would  imply  that  an  extraneous 
factor,  handedness,  has  an  influence  on  the  outcome. 

If  we  consider  the  possible  effect  of  handedness  on  the  difficulty  of  these  tasks, 
the  shots  from  the  right  side  would  tend  to  be  easier  for  right-handed  marksmen.  Since 
eighty  to  ninety  per  cent  of  the  sample  would  be  right  handed,  a  shot  from  the  right 
would  appear  easier  than  the  equivalent  shot  from  the  left.  However,  the  effect  is 
reversed  for  a  left-handed  marksman  and  this  person  would  do  poorly  on  the  "easy" 
right  handed  shots  and  well  on  the  "difficult"  left  handed  shots.  While  this  would 
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produce  misfit  over  oil  person-task  combinations,  "surprising"  results  would  tend  to 
accumulate  an  Hie  shots  from  the  left  because  of  the  predominance  of  right-handed 
marksmen  in  the  data,  ft  might  be  eliminated  by  defining  the  shots  as  favored  and 
not  favored  rather  than  left  and  right. 

The  between  group  mean  square  tests  the  a^eement  between  the  observed  item 
characteristic  curve  and  the  best  fitting  Resch  characteristic  curve  as  estimated  by  the 
groups  selected.  Five  points  on  the  observed  curve  are  shown  for  each  item  on  the  left 
of  page  8.  The  points  shown  were  chosen  by  the  program  to  represent  groups  of  in¬ 
creasing  ability  and  approximately  equal  size  such  that  the  average  group  size  is  at 
least  25. 

The  worst  discrepancies  between  the  curves  are  for  targets  4  and  7,  both  of  which 
involve  firing  from  the  right  side.  In  particular,  for  target  4  score  group  two  was 
seventy  per  cent  successful  while  group  three  was  only  62  per  cent  successful.  The 
model  predicted  64  and  73  per  cent  respectively.  The  discrepancy  in  proportion  metric 
is  given  in  the  center  panel  of  page  8.  Complete  understanding  of  the  reasons  for  this 
requires  greater. knowledge  of  the  effect  of  handedness  on  marksmanship  but  one 
hypothesis  is  that  ability  group  three  contained  a  preponderance  of  left-handed  persons 
who  do  poorer  than  expected  on  shots  from  the  right. 

The  remaining  column  on  page  8  is  the  within  group  mean  square.  It  is  the  misfit 
remaining  after  removing  the  effect  of  difference  in  the  shape  of  the  characteristic  curves. 
It  will  be  large  and  the  between  group  effect  small  if  the  correct  proportion  of  the  group 
succeeded  but  the  wrong  people  in  the  group  were  the  ones  who  succeeded.  It  provides 
no  information  not  contained  in  the  between  and  the  total  but  is  a  reorganization  that 


is  sometimes  convenient. 
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The  discrimination  index  is  also  closely  related  to  the  between  group  mean  square. 

It  is  in  fact  the  linear  trend  across  score  groups.  Values  larger  than  one  indicate  that 
the  observed  characteristic  curve  for  an  item  is  steeper  than  the  average  best  fitting 
logistic  curve  for  all  items;  values  less  than  one  indicate  the  curve  is  flatter.  In  this 
example  there  is  no  reason  to  suspect  that  the  targets  do  not  all  have  equal  discriminations. 
In  data  simulated  with  exactly  equal  discriminations  the  standard  deviation  of  the 
observed  discriminations  are  frequently  as  large  as  0.20,  hence,  the  value  observed 
here  (0.11,  from  page  9)  is  quite  acceptable. 

Page  10  contains  a  plot  of  the  discrepancies,  standardized  and  squared,  between 
the  observed  and  fitted  characteristic  curves  (center  panel,  page  8)  against  the 
probability  of  success  for  that  group  on  that  item.  In  this  case,  this  plot  does  little 
to  increase  our  understanding.  It  is  useful  with  achievement  tests  where  random 
guessing  is  a  problem.  In  those  situations  large  values  of  the  z-squares  are  found  near 
the  chance  level. 

Pages  11,  12  and  1 3  are  two-way  plots  of  the  three  statistics  given  for  each  item 
on  page  9;  difficulty,  discrimination  and  total  fit  mean  square.  There  is  no  new  in¬ 
formation  in  them,  but  examining  the  plots  is  a  convenient  way  to  be  certain  not  to 
miss  any  interesting  results. 
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APPENDIX  C 


BICAL  CONTROL  CARDS 


Position 

Name 

Format  and  Description 

1 

Title  card 

(20A4) 

Descriptive  heading  to  be  printed 
at  the  top  of  each  page  of  output 

2 

Data  Definition 

0415) 

CC 

Label 

Definition 

•• 

1  -  5 

* 

NITEM 

Total  number  of  items  to  be  read 
before  deletions.  This  is  equal  to 
the  number  of  non-zero  entries  on 
the  column  select  card  and  is  the 
number  of  item  names  expected. 

6-10 

NGROP 

Smallest  allowable  average  group 
size  for  testing  item  fit.  This  is 
used  to  determine  the  number  of 
score  groups.  The  same  value  is 
used  to  terminate  execution  before 
estimation  if  the  total  number  of 
subjects  is  less  than  NGROP. 

11  -  15 

Ml  NSC 

Minimum  score  to  be  included  in 
the  calibration  sample. 

16-20 

MAXSC 

Maximum  score  to  be  included  . 

21-25 

LREC 

Number  of  columns  in  the  input 
record  to  be  scanned.  It  must  be 

large  enough  to  cover  all  columns 
containing  Items  and  also  to  skip  any 
extra  cards  In  the  subject  record. 
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26-30 

KCAB 

Calibration  code 

•»*. 

1  *  Normal  approximation  method, 

should  be  used  with  long  tests 
and  symmetrical  distribution 

6f  scores. 

2  *  Corrected  unconditional  maximum 

likelihood  estimation.  Should  be 
used  with  shorter  tests  and 
skewed  distributions. 

31-35 

KSCOR 

Scoring  code 

b,0  ■  score  dichotomously 
according  to  KEY 

1  *  data  already  scored 

2  *  score  dichotomously, 

correct  if  X  s  KEY 

3  *  score  dichotomously, 

correct  if  X  s.  KEY 

36-40 

INFLE 

Alternative  input  file  unit  number 
b,0  *  Unit  5. 

41-45 

LLIM 

Alternative  output  file— start  of 
identification  field  in  record 

46-50 

KLIM 

Alternative  output  file — end  of 
identification  field  in  record. 

If  LLIM  and  KLIM  are  1  and  LREC 
the  entire  record  will  be  copied 
as  the  identification. 

51-55 

NUFLE 

Alternative  output  file  logical 
unit  number.  For  each  valid 
-  Input  record,  a  new  record  will 
be  generated  containing  raw 
score,  scaled  ability  in  logits 
ond  the  identification  field 
defined  by  LLIM  and  KLIM. 

56-60 

KPRTR 

Control  switch  for  optional  output 
b,0  Print  all  plots 

1  Omit  score  histogram 

2  Omit  Fit  plots 

3  Omit  both 

61-65 


KSIM 


Print  simulated  persons  If  >  0 
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66  -  70  KDIFF 


3.  Item  Name  Card(s) 

4.  Column  Selection 


5.  Scoring  Key 

6.  Options  Label 

0  Datq  cards 

(To)  End  of  Data 

Simulation  header 


Punch  item  statistics  on  unit  7. 

Output  insists  of  item  sequence 
number,  item  name,  difficulty, 
discrimination  and  total  fit 
mean  square.  Format  is 
(I3,A4, 3F7.3). 

(20A4)  , 

A  four  character  alphanumeric  name 

for  each  of  the  "NITEM"  items. 

(80A!) 

A  record  identical  in  size  to  each 
person's  record  indicating  how  the  data 
in  that  position  is  to  be  used. 

For  each  position 
b,0  *  skip  column 
1-9  *  include  item  in  corres¬ 
ponding  column.  Maximum 
allowable  code  is  1-9  as 
given. 

A-Z*  include  item  in  corres¬ 
ponding  column.  Maximum 
allowable  code  is  10-35 
(A»10,  etc.) 

4*  delete  item  in  corresponding 
column  after  reading  names. 

(80AI) 

Corresponds  to  perfect  input  record. 

It  must  be  included  regardless 
of  KSCOR. 

(5AI) 

Identifies  up  to  five  option  labels  for 
which  the  number  of  occurrences  will 
be  counted  for  each  item. 


*  In  eol.  f. 

SIMULATE  in  columns  1-8  causes 
program  to  simulate  data  rather 
then  read.  If  included  it  must  be 
followed  by 
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(9)  Simulation  task  description  cord  (F5.0, 15, 2F5. 0,510) 


C.C. 

Label 

Definition 

1  -  5 

WIDTH 

Range  of  item  difficulties  to  be 
generated. 

6-10 

ISUBJ 

Number  of  persons  to  be  generated. 

11  -  15 

GMEAN 

Mean  ability  of  population  sampled. 

16-20 

SD 

Standard  deviation  of  ability. 

21-25 

1SED 

Seed  for  random  number  generator- 
should  only  be  coded  for  first 
generation  in  each  run. 

End  of  fob 

****  In  columns  1-4 

Program  will  keep  recycling  looking 

for  now  problems  until  this  card 
is  encountered.  As  many  {obs  as 
desired  may  be  stocked. 


University  of  Chicago  XL 

/A\\\  JOB  (valid  UC  job  card)  ,RE-129K 
//  EXEC  PGM=B|CAL 

//STEPLIB  DO  DSN*S2DD130.S05.DATA(BICAL)  ,DISP=SHR 
//FT01F001  DD  UNIT»SYSCR,  DISP*NEW,SPACE*(TRK,  (5, 1)) 

// PTxxFOO 1  DD  '  alternative  input  file  description 
//FTyyFOOl  DD  alternative  output  file  description 
//FT07F001  DD  5YSOUT=B,DCB=(RECFM*FB,  BtKSIZE*80) 
//FT06F001  DD  SYSOUT=A,  DCB=(RECFM»FA,  BLKSI  ZE=i33) 

//FT05F001  DD* 

The  FT05  card  is  followed  by  the  first  fob  cord.  Cards  FTxx,  FTyy  and 
FT07  ere  not  always  required. 

Include  FTxx  if  input  records  are  not  on  cards.  The  xx  should 
be  replaced  by  the  value  of  INFLE  coded  on  the  dote 
description  card  (cc  36-40). 

Include  FTyy  If  a  new  output  is  to  be  produced.  The  yy  should 
be  replaced  by  the  value  of  NUFLE  (cc  51-55). 

Include  FT07  if  item  data  is  to  be  punched,  (cc  66-70). 


